6 research outputs found

    Look and Modify: Modification Networks for Image Captioning

    Full text link
    Attention-based neural encoder-decoder frameworks have been widely used for image captioning. Many of these frameworks deploy their full focus on generating the caption from scratch by relying solely on the image features or the object detection regional features. In this paper, we introduce a novel framework that learns to modify existing captions from a given framework by modeling the residual information, where at each timestep the model learns what to keep, remove or add to the existing caption allowing the model to fully focus on "what to modify" rather than on "what to predict". We evaluate our method on the COCO dataset, trained on top of several image captioning frameworks and show that our model successfully modifies captions yielding better ones with better evaluation scores.Comment: Published in BMVC 201

    Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

    Full text link
    Natural Language Explanations (NLE) aim at supplementing the prediction of a model with human-friendly natural text. Existing NLE approaches involve training separate models for each downstream task. In this work, we propose Uni-NLX, a unified framework that consolidates all NLE tasks into a single and compact multi-task model using a unified training objective of text generation. Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of 144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of 123K samples for explaining the task of Visual Question Answering (VQA). Both datasets are derived leveraging large language models (LLMs). By training on the 1M combined NLE samples, our single unified framework is capable of simultaneously performing seven NLE tasks including VQA, visual recognition and visual reasoning tasks with 7X fewer parameters, demonstrating comparable performance to the independent task-specific models in previous approaches, and in certain tasks even outperforming them. Code is at https://github.com/fawazsammani/uni-nlxComment: Accepted to ICCVW 202

    Show, Edit and Tell: A Framework for Editing Image Captions

    Full text link
    Most image captioning frameworks generate captions directly from images, learning a mapping from visual features to natural language. However, editing existing captions can be easier than generating new ones from scratch. Intuitively, when editing captions, a model is not required to learn information that is already present in the caption (i.e. sentence structure), enabling it to focus on fixing details (e.g. replacing repetitive words). This paper proposes a novel approach to image captioning based on iterative adaptive refinement of an existing caption. Specifically, our caption-editing model consisting of two sub-modules: (1) EditNet, a language module with an adaptive copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism (SCMA), and (2) DCNet, an LSTM-based denoising auto-encoder. These components enable our model to directly copy from and modify existing captions. Experiments demonstrate that our new approach achieves state-of-art performance on the MS COCO dataset both with and without sequence-level training.Comment: Accepted to CVPR 202

    Visualizing and Understanding Contrastive Learning

    Full text link
    Contrastive learning has revolutionized the field of computer vision, learning rich representations from unlabeled data, which generalize well to diverse vision tasks. Consequently, it has become increasingly important to explain these approaches and understand their inner workings mechanisms. Given that contrastive models are trained with interdependent and interacting inputs and aim to learn invariance through data augmentation, the existing methods for explaining single-image systems (e.g., image classification models) are inadequate as they fail to account for these factors. Additionally, there is a lack of evaluation metrics designed to assess pairs of explanations, and no analytical studies have been conducted to investigate the effectiveness of different techniques used to explaining contrastive learning. In this work, we design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We further adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations and evaluate our proposed methods with these metrics. Finally, we present a thorough analysis of visual explainability methods for contrastive learning, establish their correlation with downstream tasks and demonstrate the potential of our approaches to investigate their merits and drawbacks

    Deep convolutional networks for magnification of DICOM Brain Images

    No full text
    Convolutional neural networks have recently achieved great success in Single Image Super-Resolution (SISR). SISR is the action of reconstructing a high-quality image from a low-resolution one. In this paper, we propose a deep Convolutional Neural Network (CNN) for the enhancement of Digital Imaging and ommunications in Medicine (DICOM) brain images. The network learns an end-to-end mapping between the low and high resolution images. We first extract features from the image, where each new layer is connected to all previous layers. We then adopt residual learning and the mixture of convolutions to reconstruct the image. Our network is designed to work with grayscale images, since brain images are originally in grayscale. We further compare our method with previous works, trained on the same brain images, and show that our method outperforms them

    EEG Signal Analysis of Stroke Patients with Applications of Deep Learning

    No full text
    The methodology section is usually the second-longest section in the abstract. It should contain enough information to enable the reader to understand what was done, and important questions to which the methods section should provide brief answers
    corecore